Real-world Data WranglingΒΆ
In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.
You will retrieve and extract the data, assess the data programmatically and visually across elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.
Throughout the process, you are expected to:
- Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question.
- Write code comments so your code is more readable.
Problem Statement:ΒΆ
In this notebook, we aim to perform exploratory data analysis (EDA) and clean two datasets related to video game sales and Steam game statistics. The Video Game Sales dataset contains information about global sales of games across different platforms, while the Steam Games dataset contains metadata, user ratings, and gameplay statistics for games available on Steam.
The key objectives of this analysis are:
Data Cleaning and Quality Issues:
- Address missing values, inconsistent data types, and remove redundant columns.
- Ensure that the data is properly formatted to allow for accurate analysis.
Tidiness Issues:
- Split columns that contain multiple variables (e.g., platforms and categories) into separate, one-hot encoded columns to adhere to tidy data principles.
- Ensure the dataset follows a structure where each variable has its own column and each observation is in its own row.
Exploratory Data Analysis (EDA):
- Analyze trends in video game sales across different regions and platforms.
- Examine user ratings, playtime statistics, and the relationship between game price and user feedback on Steam.
Through this process, we aim to gain insights into the video game market across platforms and regions, as well as understand patterns in game performance on the Steam platform.
1.2. Gather at least two datasets using two different data gathering methodsΒΆ
List of data gathering methods:
- Download data manually
- Programmatically downloading files
- Gather data by accessing APIs
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database
Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.
For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code).
Load the dataset programmatically into this notebook.
Dataset 1ΒΆ
Type: CSV File
Method: Manually downloaded the Video Game Sales dataset from Kaggle (vgsales.csv) using the download button.
Dataset variables:
- Rank - Ranking of overall sales
- Name - The games name
- Platform - Platform of the games release (i.e. PC,PS4, etc.)
- Year - Year of the game's release
- Genre - Genre of the game
- Publisher - Publisher of the game
- NA_Sales - Sales in North America (in millions)
- EU_Sales - Sales in Europe (in millions)
- JP_Sales - Sales in Japan (in millions)
- Other_Sales - Sales in the rest of the world (in millions)
- Global_Sales - Total worldwide sales.
# Ensure plotly renders inline
import plotly.io as pio
pio.renderers.default = 'notebook'
# Downloaded Manually. See vgsales.csv
import pandas as pd
# Load dataset
vgsales_df = pd.read_csv('vgsales.csv')
print(vgsales_df.head())
Rank Name Platform Year Genre Publisher \ 0 1 Wii Sports Wii 2006.0 Sports Nintendo 1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales 0 41.49 29.02 3.77 8.46 82.74 1 29.08 3.58 6.81 0.77 40.24 2 15.85 12.88 3.79 3.31 35.82 3 15.75 11.01 3.28 2.96 33.00 4 11.27 8.89 10.22 1.00 31.37
Dataset 2ΒΆ
Type: CSV File
Method: Using python, programmatically downloaded and unzipped file.
Dataset variables:
- appid - unique identifier to differentiate between original and rerelease/remaster
- name - The game's name
- release_date - Date of game's release
- english - binary yes/no for english language
- developer - developer name
- publisher - publisher name
- platforms - available platforms for the game
- required_age - age requirement, 0 if not applicable or not available
- categories
- genres - game genres
- steamspy_tags -
- achievements - available achievements for the game
- positive_ratings - number of positive ratings
- negative_ratings - number of positive ratings
- average_playtime - average playtime
- median_playtime - average playtime
- owners - number of people that own the game
- price - price of game
# Code found in datapull.py
# Load dataset
steam_df = pd.read_csv('steam.csv')
2. Assess dataΒΆ
Assess the data according to data quality and tidiness metrics using the report below.
List two data quality issues and two tidiness issues. Assess each data issue visually and programmatically, then briefly describe the issue you find. Make sure you include justifications for the methods you use for the assessment.
Quality Issue 1:ΒΆ
# Inspecting dataframes visually
# Print first few rows and dataframe info for each
print(vgsales_df.head())
vgsales_df.info()
print(steam_df.head())
steam_df.info()
Rank Name Platform Year Genre Publisher \
0 1 Wii Sports Wii 2006.0 Sports Nintendo
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo
NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 41.49 29.02 3.77 8.46 82.74
1 29.08 3.58 6.81 0.77 40.24
2 15.85 12.88 3.79 3.31 35.82
3 15.75 11.01 3.28 2.96 33.00
4 11.27 8.89 10.22 1.00 31.37
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 16598 non-null int64
1 Name 16598 non-null object
2 Platform 16598 non-null object
3 Year 16327 non-null float64
4 Genre 16598 non-null object
5 Publisher 16540 non-null object
6 NA_Sales 16598 non-null float64
7 EU_Sales 16598 non-null float64
8 JP_Sales 16598 non-null float64
9 Other_Sales 16598 non-null float64
10 Global_Sales 16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB
appid name release_date english developer \
0 10 Counter-Strike 2000-11-01 1 Valve
1 20 Team Fortress Classic 1999-04-01 1 Valve
2 30 Day of Defeat 2003-05-01 1 Valve
3 40 Deathmatch Classic 2001-06-01 1 Valve
4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software
publisher platforms required_age \
0 Valve windows;mac;linux 0
1 Valve windows;mac;linux 0
2 Valve windows;mac;linux 0
3 Valve windows;mac;linux 0
4 Valve windows;mac;linux 0
categories genres \
0 Multi-player;Online Multi-Player;Local Multi-P... Action
1 Multi-player;Online Multi-Player;Local Multi-P... Action
2 Multi-player;Valve Anti-Cheat enabled Action
3 Multi-player;Online Multi-Player;Local Multi-P... Action
4 Single-player;Multi-player;Valve Anti-Cheat en... Action
steamspy_tags achievements positive_ratings \
0 Action;FPS;Multiplayer 0 124534
1 Action;FPS;Multiplayer 0 3318
2 FPS;World War II;Multiplayer 0 3416
3 Action;FPS;Multiplayer 0 1273
4 FPS;Action;Sci-fi 0 5250
negative_ratings average_playtime median_playtime owners \
0 3339 17612 317 10000000-20000000
1 633 277 62 5000000-10000000
2 398 187 34 5000000-10000000
3 267 258 184 5000000-10000000
4 288 624 415 5000000-10000000
price
0 7.19
1 3.99
2 3.99
3 3.99
4 3.99
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 appid 27075 non-null int64
1 name 27075 non-null object
2 release_date 27075 non-null object
3 english 27075 non-null int64
4 developer 27074 non-null object
5 publisher 27061 non-null object
6 platforms 27075 non-null object
7 required_age 27075 non-null int64
8 categories 27075 non-null object
9 genres 27075 non-null object
10 steamspy_tags 27075 non-null object
11 achievements 27075 non-null int64
12 positive_ratings 27075 non-null int64
13 negative_ratings 27075 non-null int64
14 average_playtime 27075 non-null int64
15 median_playtime 27075 non-null int64
16 owners 27075 non-null object
17 price 27075 non-null float64
dtypes: float64(1), int64(8), object(9)
memory usage: 3.7+ MB
# Check summary of dataset
print("\nDataset information:")
vgsales_df.info()
# Check for missing values
print("\nMissing values in each column:")
missing_data = vgsales_df.isnull().sum()
print(missing_data)
# Display basic statistics
print("\nStatistical summary:")
print(vgsales_df.describe())
# Check summary of dataset
print("\nDataset information:")
steam_df.info()
# Check for missing values
print("\nMissing values in each column:")
missing_data = steam_df.isnull().sum()
print(missing_data)
# Display basic statistics
print("\nStatistical summary:")
print(steam_df.describe())
Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 16598 non-null int64
1 Name 16598 non-null object
2 Platform 16598 non-null object
3 Year 16327 non-null float64
4 Genre 16598 non-null object
5 Publisher 16540 non-null object
6 NA_Sales 16598 non-null float64
7 EU_Sales 16598 non-null float64
8 JP_Sales 16598 non-null float64
9 Other_Sales 16598 non-null float64
10 Global_Sales 16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB
Missing values in each column:
Rank 0
Name 0
Platform 0
Year 271
Genre 0
Publisher 58
NA_Sales 0
EU_Sales 0
JP_Sales 0
Other_Sales 0
Global_Sales 0
dtype: int64
Statistical summary:
Rank Year NA_Sales EU_Sales JP_Sales \
count 16598.000000 16327.000000 16598.000000 16598.000000 16598.000000
mean 8300.605254 2006.406443 0.264667 0.146652 0.077782
std 4791.853933 5.828981 0.816683 0.505351 0.309291
min 1.000000 1980.000000 0.000000 0.000000 0.000000
25% 4151.250000 2003.000000 0.000000 0.000000 0.000000
50% 8300.500000 2007.000000 0.080000 0.020000 0.000000
75% 12449.750000 2010.000000 0.240000 0.110000 0.040000
max 16600.000000 2020.000000 41.490000 29.020000 10.220000
Other_Sales Global_Sales
count 16598.000000 16598.000000
mean 0.048063 0.537441
std 0.188588 1.555028
min 0.000000 0.010000
25% 0.000000 0.060000
50% 0.010000 0.170000
75% 0.040000 0.470000
max 10.570000 82.740000
Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 appid 27075 non-null int64
1 name 27075 non-null object
2 release_date 27075 non-null object
3 english 27075 non-null int64
4 developer 27074 non-null object
5 publisher 27061 non-null object
6 platforms 27075 non-null object
7 required_age 27075 non-null int64
8 categories 27075 non-null object
9 genres 27075 non-null object
10 steamspy_tags 27075 non-null object
11 achievements 27075 non-null int64
12 positive_ratings 27075 non-null int64
13 negative_ratings 27075 non-null int64
14 average_playtime 27075 non-null int64
15 median_playtime 27075 non-null int64
16 owners 27075 non-null object
17 price 27075 non-null float64
dtypes: float64(1), int64(8), object(9)
memory usage: 3.7+ MB
Missing values in each column:
appid 0
name 0
release_date 0
english 0
developer 1
publisher 14
platforms 0
required_age 0
categories 0
genres 0
steamspy_tags 0
achievements 0
positive_ratings 0
negative_ratings 0
average_playtime 0
median_playtime 0
owners 0
price 0
dtype: int64
Statistical summary:
appid english required_age achievements \
count 2.707500e+04 27075.000000 27075.000000 27075.000000
mean 5.962035e+05 0.981127 0.354903 45.248864
std 2.508942e+05 0.136081 2.406044 352.670281
min 1.000000e+01 0.000000 0.000000 0.000000
25% 4.012300e+05 1.000000 0.000000 0.000000
50% 5.990700e+05 1.000000 0.000000 7.000000
75% 7.987600e+05 1.000000 0.000000 23.000000
max 1.069460e+06 1.000000 18.000000 9821.000000
positive_ratings negative_ratings average_playtime median_playtime \
count 2.707500e+04 27075.000000 27075.000000 27075.00000
mean 1.000559e+03 211.027147 149.804949 146.05603
std 1.898872e+04 4284.938531 1827.038141 2353.88008
min 0.000000e+00 0.000000 0.000000 0.00000
25% 6.000000e+00 2.000000 0.000000 0.00000
50% 2.400000e+01 9.000000 0.000000 0.00000
75% 1.260000e+02 42.000000 0.000000 0.00000
max 2.644404e+06 487076.000000 190625.000000 190625.00000
price
count 27075.000000
mean 6.078193
std 7.874922
min 0.000000
25% 1.690000
50% 3.990000
75% 7.190000
max 421.990000
Data Quality Issues:ΒΆ
Missing Values:
- In the Video Game Sales dataset:
Yearcolumn has 271 missing values.Publishercolumn has 58 missing values.
- In the Steam Games dataset:
developercolumn has 1 missing value.publishercolumn has 14 missing values.
- In the Video Game Sales dataset:
Inconsistent Data Types:
- In the Steam Games dataset, the
release_datecolumn is an object (string), but it should be adatetimetype for proper time-based analysis. - In the Video Game Sales dataset, the
Yearcolumn is afloat64, which should be converted to an integer or date-based format for clarity.
- In the Steam Games dataset, the
Tidiness Issues:ΒΆ
Redundant
Global_SalesColumn:- In the Video Game Sales dataset, the
Global_Salescolumn is a sum of the regional sales (NA_Sales,EU_Sales,JP_Sales,Other_Sales). This could be derived and is redundant, however, we are not interested in
- In the Video Game Sales dataset, the
Multiple Variables in a Single Column:
- In the Steam Games dataset, the
platformsandcategoriescolumns contain multiple values (separated by semicolons). Each of these should ideally be split into individual columns for proper analysis.
- In the Steam Games dataset, the
3. Clean dataΒΆ
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. Make sure you include justifications for your cleaning decisions.
After the cleaning for each issue, please use either the visually or programmatically method to validate the cleaning was successful.
# Load datasets
vgsales_df = pd.read_csv('vgsales.csv')
steam_df = pd.read_csv('steam.csv')
# Make copies of datasets
vgsales_copy = vgsales_df.copy()
steam_copy = steam_df.copy()
# Check first few rows to confirm data is loaded correctly
print(vgsales_copy.head())
print(steam_copy.head())
Rank Name Platform Year Genre Publisher \
0 1 Wii Sports Wii 2006.0 Sports Nintendo
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo
NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 41.49 29.02 3.77 8.46 82.74
1 29.08 3.58 6.81 0.77 40.24
2 15.85 12.88 3.79 3.31 35.82
3 15.75 11.01 3.28 2.96 33.00
4 11.27 8.89 10.22 1.00 31.37
appid name release_date english developer \
0 10 Counter-Strike 2000-11-01 1 Valve
1 20 Team Fortress Classic 1999-04-01 1 Valve
2 30 Day of Defeat 2003-05-01 1 Valve
3 40 Deathmatch Classic 2001-06-01 1 Valve
4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software
publisher platforms required_age \
0 Valve windows;mac;linux 0
1 Valve windows;mac;linux 0
2 Valve windows;mac;linux 0
3 Valve windows;mac;linux 0
4 Valve windows;mac;linux 0
categories genres \
0 Multi-player;Online Multi-Player;Local Multi-P... Action
1 Multi-player;Online Multi-Player;Local Multi-P... Action
2 Multi-player;Valve Anti-Cheat enabled Action
3 Multi-player;Online Multi-Player;Local Multi-P... Action
4 Single-player;Multi-player;Valve Anti-Cheat en... Action
steamspy_tags achievements positive_ratings \
0 Action;FPS;Multiplayer 0 124534
1 Action;FPS;Multiplayer 0 3318
2 FPS;World War II;Multiplayer 0 3416
3 Action;FPS;Multiplayer 0 1273
4 FPS;Action;Sci-fi 0 5250
negative_ratings average_playtime median_playtime owners \
0 3339 17612 317 10000000-20000000
1 633 277 62 5000000-10000000
2 398 187 34 5000000-10000000
3 267 258 184 5000000-10000000
4 288 624 415 5000000-10000000
price
0 7.19
1 3.99
2 3.99
3 3.99
4 3.99
Quality IssuesΒΆ
# Handling data quality issue in vgsales_copy (Video Game Sales dataset)
# Drop 'Year' column
vgsales_copy = vgsales_copy.drop(columns=['Year'])
# Fill missing values in 'Publisher' with 'Unknown'
vgsales_copy['Publisher'] = vgsales_copy['Publisher'].fillna('Unknown')
# Ensure float columns are correctly formatted & round decimals to 2 places
sales_columns = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']
vgsales_copy[sales_columns] = vgsales_copy[sales_columns].round(2)
# Handling data quality issue in steam_copy (Steam Games dataset)
# Fill missing values in 'developer' and 'publisher' columns with 'Unknown'
steam_copy['developer'] = steam_copy['developer'].fillna('Unknown')
steam_copy['publisher'] = steam_copy['publisher'].fillna('Unknown')
# Convert 'release_date' column to datetime format
steam_copy['release_date'] = pd.to_datetime(steam_copy['release_date'], errors='coerce')
# Check for missing values in vgsales_copy
print(vgsales_copy.isnull().sum())
# Check for missing values in steam_copy
print(steam_copy.isnull().sum())
# Display dtypes to confirm correct types
print(steam_copy.dtypes)
Rank 0 Name 0 Platform 0 Genre 0 Publisher 0 NA_Sales 0 EU_Sales 0 JP_Sales 0 Other_Sales 0 Global_Sales 0 dtype: int64 appid 0 name 0 release_date 0 english 0 developer 0 publisher 0 platforms 0 required_age 0 categories 0 genres 0 steamspy_tags 0 achievements 0 positive_ratings 0 negative_ratings 0 average_playtime 0 median_playtime 0 owners 0 price 0 dtype: int64 appid int64 name object release_date datetime64[ns] english int64 developer object publisher object platforms object required_age int64 categories object genres object steamspy_tags object achievements int64 positive_ratings int64 negative_ratings int64 average_playtime int64 median_playtime int64 owners object price float64 dtype: object
Justification: Data formats now match and there are no missing values.
Tidiness IssuesΒΆ
# Addressing tidiness issue #2: Splitting 'platforms' and 'categories' columns in steam_copy
# Splitting 'platforms' into separate columns
platforms_expanded = steam_copy['platforms'].str.get_dummies(sep=';')
steam_copy = pd.concat([steam_copy, platforms_expanded], axis=1)
# Splitting 'categories' into separate columns
categories_expanded = steam_copy['categories'].str.get_dummies(sep=';')
steam_copy = pd.concat([steam_copy, categories_expanded], axis=1)
# Drop original 'platforms' and 'categories' columns since they have been expanded and are now redundant
steam_copy = steam_copy.drop(columns=['platforms', 'categories'])
# Checking modified dataframes
print(vgsales_copy.head())
print(steam_copy.head())
Rank Name Platform Genre Publisher NA_Sales \ 0 1 Wii Sports Wii Sports Nintendo 41.49 1 2 Super Mario Bros. NES Platform Nintendo 29.08 2 3 Mario Kart Wii Wii Racing Nintendo 15.85 3 4 Wii Sports Resort Wii Sports Nintendo 15.75 4 5 Pokemon Red/Pokemon Blue GB Role-Playing Nintendo 11.27 EU_Sales JP_Sales Other_Sales Global_Sales 0 29.02 3.77 8.46 82.74 1 3.58 6.81 0.77 40.24 2 12.88 3.79 3.31 35.82 3 11.01 3.28 2.96 33.00 4 8.89 10.22 1.00 31.37 appid name release_date english developer \ 0 10 Counter-Strike 2000-11-01 1 Valve 1 20 Team Fortress Classic 1999-04-01 1 Valve 2 30 Day of Defeat 2003-05-01 1 Valve 3 40 Deathmatch Classic 2001-06-01 1 Valve 4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software publisher required_age genres steamspy_tags achievements \ 0 Valve 0 Action Action;FPS;Multiplayer 0 1 Valve 0 Action Action;FPS;Multiplayer 0 2 Valve 0 Action FPS;World War II;Multiplayer 0 3 Valve 0 Action Action;FPS;Multiplayer 0 4 Valve 0 Action FPS;Action;Sci-fi 0 ... Stats Steam Achievements Steam Cloud Steam Leaderboards \ 0 ... 0 0 0 0 1 ... 0 0 0 0 2 ... 0 0 0 0 3 ... 0 0 0 0 4 ... 0 0 0 0 Steam Trading Cards Steam Turn Notifications Steam Workshop \ 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 SteamVR Collectibles VR Support Valve Anti-Cheat enabled 0 0 0 1 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 [5 rows x 48 columns]
5. Answer the research questionΒΆ
5.1: Define and answer the research questionΒΆ
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce at least two visualizations using the cleaned data and explain how they help you answer the question.
Research question:
- Analyze trends in video game sales across different regions.
- Examine user ratings, playtime statistics, and the relationship between game price and user feedback on Steam.
# Filter vgsales_copy dataset to only include rows where Platform is 'PC'
vgsales_pc = vgsales_copy[vgsales_copy['Platform'] == 'PC']
# Inner join on 'name' from steam_copy and 'Name' from vgsales_copy
merged_df = pd.merge(steam_copy, vgsales_pc, left_on='name', right_on='Name', how='inner')
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
# Group sales data by Publisher and region
publisher_sales_by_region = merged_df.groupby('Publisher')[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum()
# Find best performing publisher for each
best_publishers_na = publisher_sales_by_region['NA_Sales'].nlargest(5)
best_publishers_eu = publisher_sales_by_region['EU_Sales'].nlargest(5)
best_publishers_jp = publisher_sales_by_region['JP_Sales'].nlargest(5)
best_publishers_other = publisher_sales_by_region['Other_Sales'].nlargest(5)
# Plot bar charts for top publishers in each
fig_na = px.bar(best_publishers_na, x=best_publishers_na.index, y='NA_Sales',
title="Top 5 Publishers in North America by Sales",
labels={'NA_Sales': 'Sales in Millions'}, color='NA_Sales')
fig_eu = px.bar(best_publishers_eu, x=best_publishers_eu.index, y='EU_Sales',
title="Top 5 Publishers in Europe by Sales",
labels={'EU_Sales': 'Sales in Millions'}, color='EU_Sales')
fig_jp = px.bar(best_publishers_jp, x=best_publishers_jp.index, y='JP_Sales',
title="Top 5 Publishers in Japan by Sales",
labels={'JP_Sales': 'Sales in Millions'}, color='JP_Sales')
fig_other = px.bar(best_publishers_other, x=best_publishers_other.index, y='Other_Sales',
title="Top 5 Publishers in Other Regions by Sales",
labels={'Other_Sales': 'Sales in Millions'}, color='Other_Sales')
# Show plots
fig_na.show()
fig_eu.show()
fig_jp.show()
fig_other.show()
# Calculate total sales by region
region_sales = {
'Region': ['North America', 'Europe', 'Japan', 'Other'],
'Sales': [merged_df['NA_Sales'].sum(),
merged_df['EU_Sales'].sum(),
merged_df['JP_Sales'].sum(),
merged_df['Other_Sales'].sum()]
}
# Create a DataFrame from region sales data
region_sales_df = pd.DataFrame(region_sales)
# Pie chart showing total revenue by region
fig_pie_revenue_by_region = px.pie(region_sales_df, values='Sales', names='Region',
title="Total Revenue Distribution by Region")
# Show pie chart
fig_pie_revenue_by_region.show()
The graphs above illustrate different publisher's success by region. While it appears that each region may have a preference for certain Publishers, this could also be due to other factors, such as regulations or markets that Publishers may or may not operate in. About half of all sales occur in North America, illustrating it to be the largest consuming market. However, this data could be misleading without considering overall populations and market percentage of each region.
# Group by Publisher and sum positive and negative ratings
ratings_by_publisher = merged_df.groupby('Publisher')[['positive_ratings', 'negative_ratings']].sum().reset_index()
# Calculate ratio of positive to negative ratings, handle division by zero
ratings_by_publisher['rating_ratio'] = ratings_by_publisher['positive_ratings'] / ratings_by_publisher['negative_ratings'].replace(0, 1)
# Sort by ratio of positive to negative ratings
ratings_by_publisher = ratings_by_publisher.sort_values(by='rating_ratio', ascending=False)
# Plot bar chart for ratio of positive to negative ratings by publisher
fig_ratio = px.bar(ratings_by_publisher, x='Publisher', y='rating_ratio',
title="Positive to Negative Ratings Ratio by Publisher",
labels={'rating_ratio': 'Positive/Negative Rating Ratio', 'Publisher': 'Publisher'},
color='rating_ratio')
# Customize layout to avoid overlapping labels
fig_ratio.update_layout(xaxis_tickangle=-45, xaxis_title="Publisher", yaxis_title="Positive to Negative Rating Ratio")
# Show plot for positive-to-negative rating ratio
fig_ratio.show()
# Now, create a chart for revenue (global sales) of each publisher
# Group by Publisher and sum Global Sales (Revenue) column
revenue_by_publisher = vgsales_pc.groupby('Publisher')['Global_Sales'].sum().reset_index()
# Filter to include only publishers with over 0.75 million in revenue
revenue_by_publisher_filtered = revenue_by_publisher[revenue_by_publisher['Global_Sales'] > 0.75]
# Sort by total Global Sales
revenue_by_publisher_filtered = revenue_by_publisher_filtered.sort_values(by='Global_Sales', ascending=False)
# Plot bar chart for revenue by publisher
fig_revenue = px.bar(revenue_by_publisher_filtered, x='Publisher', y='Global_Sales',
title="Revenue (Global Sales) by Publisher (over 0.75 million)",
labels={'Global_Sales': 'Revenue (in Millions)', 'Publisher': 'Publisher'},
color='Global_Sales')
# Customize layout to avoid overlapping labels
fig_revenue.update_layout(xaxis_tickangle=-45, xaxis_title="Publisher", yaxis_title="Revenue (in Millions)")
# Show plot for revenue by publisher
fig_revenue.show()
# Now, let's add a graph for average revenue per game by publisher
# Group by Publisher, sum Global Sales, and count number of games
publisher_stats = vgsales_pc.groupby('Publisher').agg(
total_revenue=('Global_Sales', 'sum'),
num_games=('Name', 'count')
).reset_index()
# Calculate average revenue per game
publisher_stats['avg_revenue_per_game'] = publisher_stats['total_revenue'] / publisher_stats['num_games']
# Filter publishers with average revenue per game greater than 0.3 million
publisher_stats_filtered = publisher_stats[publisher_stats['avg_revenue_per_game'] > 0.3]
# Sort by average revenue per game in descending order
publisher_stats_filtered = publisher_stats_filtered.sort_values(by='avg_revenue_per_game', ascending=False)
# Plot bar chart for average revenue per game by publisher
fig_avg_revenue = px.bar(
publisher_stats_filtered,
x='Publisher',
y='avg_revenue_per_game',
title="Average Revenue per Game by Publisher (over 0.3 million)",
labels={'avg_revenue_per_game': 'Average Revenue per Game (in Millions)', 'Publisher': 'Publisher'},
color='avg_revenue_per_game'
)
# Customize layout to avoid overlapping labels
fig_avg_revenue.update_layout(xaxis_tickangle=-45, xaxis_title="Publisher", yaxis_title="Average Revenue per Game (in Millions)")
# Show plot for average revenue per game by publisher
fig_avg_revenue.show()
# Optional: Log-log analysis for correlation between ratings and revenue as in previous code
# Apply log transformation to positive and negative ratings and global sales
merged_df['log_positive_ratings'] = np.log1p(merged_df['positive_ratings'])
merged_df['log_negative_ratings'] = np.log1p(merged_df['negative_ratings'])
merged_df['log_global_sales'] = np.log1p(merged_df['Global_Sales'])
# Scatter Plot 1: Log of Positive Ratings vs Log of Global Sales
fig_loglog_positive_sales = px.scatter(merged_df, x='log_positive_ratings', y='log_global_sales',
title="Log-Log Correlation Between Positive Ratings and Global Sales",
labels={'log_positive_ratings': 'Log of Positive Ratings', 'log_global_sales': 'Log of Revenue'},
trendline="ols", # Add a trendline to visualize correlation
color='log_positive_ratings')
# Show log-log scatter plot with a trendline for Positive Ratings vs Sales
fig_loglog_positive_sales.show()
# Scatter Plot 2: Log of Negative Ratings vs Log of Global Sales
fig_loglog_negative_sales = px.scatter(merged_df, x='log_negative_ratings', y='log_global_sales',
title="Log-Log Correlation Between Negative Ratings and Global Sales",
labels={'log_negative_ratings': 'Log of Negative Ratings', 'log_global_sales': 'Log of Revenue'},
trendline="ols", # Add a trendline to visualize correlation
color='log_negative_ratings')
# Show log-log scatter plot with a trendline for Negative Ratings vs Sales
fig_loglog_negative_sales.show()
# Calculate and print correlation coefficients for log-log values
correlation_loglog_positive = merged_df[['log_positive_ratings', 'log_global_sales']].corr().iloc[0, 1]
correlation_loglog_negative = merged_df[['log_negative_ratings', 'log_global_sales']].corr().iloc[0, 1]
print(f"Correlation between log of positive ratings and log of revenue: {correlation_loglog_positive:.2f}")
print(f"Correlation between log of negative ratings and log of revenue: {correlation_loglog_negative:.2f}")
# Group by Publisher, sum Global Sales, and count number of games
publisher_stats = vgsales_pc.groupby('Publisher').agg(
total_revenue=('Global_Sales', 'sum'),
num_games=('Name', 'count')
).reset_index()
# Calculate average revenue per game
publisher_stats['avg_revenue_per_game'] = publisher_stats['total_revenue'] / publisher_stats['num_games']
# Group by Publisher and sum positive and negative ratings
ratings_by_publisher = merged_df.groupby('Publisher')[['positive_ratings', 'negative_ratings']].sum().reset_index()
# Calculate ratio of positive to negative ratings, handling division by zero
ratings_by_publisher['rating_ratio'] = ratings_by_publisher['positive_ratings'] / ratings_by_publisher['negative_ratings'].replace(0, 1)
# Merge publisher stats with ratings data
merged_publisher_stats = pd.merge(publisher_stats, ratings_by_publisher, on='Publisher')
# Apply a logarithmic transformation to positive to negative rating ratio
merged_publisher_stats['log_rating_ratio'] = np.log1p(merged_publisher_stats['rating_ratio'])
# Scatter plot: Average revenue per game vs. Log of Positive to Negative rating ratio
fig_scatter = px.scatter(
merged_publisher_stats,
x='log_rating_ratio',
y='avg_revenue_per_game',
title="Average Revenue per Game vs Log of Positive to Negative Rating Ratio by Publisher",
labels={'log_rating_ratio': 'Log of Positive/Negative Rating Ratio', 'avg_revenue_per_game': 'Average Revenue per Game (in Millions)'},
color='avg_revenue_per_game', # Color by average revenue
hover_data=['Publisher'], # Show publisher info on hover
trendline='ols' # Add a trendline for linear regression
)
# Show scatter plot
fig_scatter.show()
# Calculate RΒ² value using OLS (Ordinary Least Squares) trendline
results = px.get_trendline_results(fig_scatter)
ols_result = results.iloc[0]["px_fit_results"]
# Get RΒ² value from OLS result
r_squared = ols_result.rsquared
print(f"RΒ² value: {r_squared:.4f}")
# Calculate positive to negative rating ratio
merged_df['rating_ratio'] = merged_df['positive_ratings'] / merged_df['negative_ratings'].replace(0, 1)
# Filter necessary columns (Revenue, Time Played, Rating Ratio)
correlation_data = merged_df[['Global_Sales', 'average_playtime', 'rating_ratio']]
# Calculate correlation matrix
correlation_matrix = correlation_data.corr()
# Plot correlation heatmap using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix: Revenue vs Time Played vs Rating Ratio")
plt.show()
# Group vgsales_copy dataset by Genre and sum Global Sales
sales_by_genre = vgsales_copy.groupby('Genre')['Global_Sales'].sum().reset_index()
# Sort genres by total global sales in descending order
sales_by_genre = sales_by_genre.sort_values(by='Global_Sales', ascending=False)
# Plot bar chart for total sales by genre
fig_genre_sales = px.bar(
sales_by_genre,
x='Genre',
y='Global_Sales',
title="Total Global Sales by Genre",
labels={'Global_Sales': 'Total Sales (in Millions)', 'Genre': 'Genre'},
color='Global_Sales'
)
# Customize layout to avoid overlapping labels
fig_genre_sales.update_layout(xaxis_tickangle=-45)
# Show plot
fig_genre_sales.show()
# Group the vgsales_copy dataset by Genre, calculate the total sales and count the number of games
genre_stats = vgsales_copy.groupby('Genre').agg(
total_sales=('Global_Sales', 'sum'),
num_games=('Name', 'count')
).reset_index()
# Calculate the average revenue per game for each genre
genre_stats['avg_revenue_per_game'] = genre_stats['total_sales'] / genre_stats['num_games']
# Sort the genres by average revenue per game in descending order
genre_stats = genre_stats.sort_values(by='avg_revenue_per_game', ascending=False)
# Plot the bar chart for average revenue per game by genre
fig_avg_revenue_genre = px.bar(
genre_stats,
x='Genre',
y='avg_revenue_per_game',
title="Average Revenue per Game by Genre",
labels={'avg_revenue_per_game': 'Average Revenue per Game (in Millions)', 'Genre': 'Genre'},
color='avg_revenue_per_game'
)
# Customize the layout to avoid overlapping labels
fig_avg_revenue_genre.update_layout(xaxis_tickangle=-45)
# Show the plot
fig_avg_revenue_genre.show()
Correlation between log of positive ratings and log of revenue: 0.48 Correlation between log of negative ratings and log of revenue: 0.42
RΒ² value: 0.0655
From the graphs above, there are no strong correlations. It is difficult to confirm without additional data, but it would appear that stronger predictors are Customer Loyalty and Genre Preference. Both of these may even be tied back to the Publishing studio. Platform and Shooter genres seem to perform well, but this could also be skewed due to Customer Loyalty.
5.2: ReflectionΒΆ
In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?
Answer: I would seek additional datasets, including populations and surveys, to better understand the demographics of gamers and how they influence game sales across regions. I would also explore deeper data quality issues, such as missing or incomplete data for specific games and genres, and ensure that outliers in sales or playtime are properly handled. Additionally, I would investigate other factors influencing sales, such as marketing spend, game reviews, and platform popularity, to gain a comprehensive understanding of what drives revenue.